This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We analyze variables across different categories and examine differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.
The analysis of various indicators allows us to examine temporal
changes and potential external factors that might shape educational
outcomes.
We selected two major indicator categories that reflect
different dimensions of the education system:
Enrollment Rate vs. Attendance Rate
We will compare enrollment and attendance rates across different levels
of schooling (primary, secondary, tertiary) to test whether higher
enrollment leads to consistent attendance over time.
Number of Teachers vs. Teaching Staff
Compensation
We will investigate how changes in staffing levels correspond with
compensation across educational stages.
When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education. This is something that all of us are connected to and that our classmates and readers can easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage.
As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?
We will perform Exploratory Data Analysis (EDA) in R
using the tidyverse suite of libraries along with some
supplementary packages for smoother visualization.
Our analysis will include the following components:
Before proceeding, we install and load all necessary packages, then set global options for reproducibility and consistent output formatting. We also create global functions that will streamline table formatting for data outputs of interest.
# install.packages("kableExtra")
# install.packages("tidyverse")
# install.packages("readr")
# install.packages("knitr")
# install.packages("plotly")
# install.packages("rmarkdown")
# install.packages("broom")
# install.packages("ggpubr")
# install.packages("janitor")
library(kableExtra)
library(tidyverse) # core data manipulation & visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr) # efficient CSV import
library(knitr)
library(plotly) # interactive data visualization
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(janitor) # column name cleaning & simple tabulations
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # data parsing and time handling
library(rmarkdown)
library(broom)
library(ggpubr)
#functions to format dts
pretty_r2_table <- function(df, caption_text){
df %>%
kable("html", caption = caption_text) %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE) %>%
column_spec(3, color = ifelse(df$p.value < 0.05, "red", "black"))
}
pretty_cor_table <- function(x, y, var1_name = NULL, var2_name = NULL) {
cor_res <- cor.test(x, y)
tibble(
Variable1 = var1_name,
Variable2 = var2_name,
Correlation = round(cor_res$estimate, 3),
p.value = signif(cor_res$p.value, 3)) %>%
kable("html", caption = paste0("Correlation: ", var1_name, " vs ", var2_name)) %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE) %>%
column_spec(4, color = ifelse(cor_res$p.value < 0.05, "red", "black"))
}
set.seed(123)
opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
This dataset is very large and contains data from various countries.
As previously mentioned we are focusing only on the US data. We start
from reading the raw EdStats dataset (EdStats_v01.csv) and
filter only the United States(Country code == "USA"). We
also drop fully empty columns (2024 column) as they will
not be needed in our analysis. As a check, print any columns that
contain all NA values and proceed only if the null set is output. Then,
we write the filtered dataset to EdStats_USA.csv so that
later chunks can work only with our US data without filtering for it
multiple times.
This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.
#set up data for use
#read data
data <- read_csv("EdStats_v01.csv")
# filter for only USA data, remove NA cols
data_clean <- data %>%
filter(grepl("USA", `Country code`)) %>%
select(-`2024`) #all na
#make sure no NA cols remain
colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]
## character(0)
#save filtered dataset
write_csv(data_clean, "EdStats_USA.csv")
To keep downstream steps tidy and reproducible, we create three thematic subsets aligned to our research questions:
EdStats_attend.csv.EdStats_teacher.csv.This ensures we work with clean, directly comparable variables.
# 1. Enrollment vs. Attendance
# filter for enrollment/attendance info
data <- read_csv("EdStats_USA.csv")
data_filter_attend <- data %>%
filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) %>% view() %>%
write_csv("EdStats_attend.csv")
# 2. Teachers vs. Compensation
#filter for num teachers and teaching staff compensation info
data <- read_csv("EdStats_USA.csv")
data_filter_teacher <- data %>%
filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) %>% view() %>%
write_csv("EdStats_teacher.csv")
The following chunk expands our analysis of
EdStats_attend.csv to quantify and visualize the
relationship between school enrollment and
attendance across education levels in the United
States.
After reshaping the dataset into tidy long form, we standardize
indicator names to capture both spellings of “enrolment/enrollment” and
collapse duplicate series to obtain a single observation for each
year × level × type combination.
From these, we compute a new variable representing the
attendance gap (Enrollment – Attendance) to
measure how much access (enrollment) translates into sustained
participation (attendance).
tendency_attend.csv for
reproducibility.Reading tip:
Focus on whether attendance consistently trails enrollment and whether
that gap changes across levels or decades. High correlation indicates
structural progress—both metrics rising together—while a persistent
positive gap suggests that access alone does not guarantee
participation.
data_attend <- read_csv("EdStats_attend.csv")
#select years with most observations
data_attend_years <- data_attend %>%
select(!c(`1975`, `1986`, `1987`, `1990`, `1991`, `1993`, `1995`, `1996`, `1999`, `1994`))
#reshape data for visualization
data_long <- data_attend_years %>%
pivot_longer(cols=`2005`:`2022`, names_to="Year", values_to="Value") %>%
mutate(
Year=as.numeric(Year),
Type=case_when(
grepl("attendance", `Indicator name`, ignore.case=TRUE) ~ "Attendance",
grepl("enrolment", `Indicator name`, ignore.case=TRUE) ~ "Enrollment",
TRUE ~ "Other"
),
Level=case_when(
grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
TRUE ~ "Other"
)
) %>%
filter(Type!="Other", !is.na(Value))
data_wide <- data_long %>%
pivot_wider(
names_from=Type,
values_from=Value
)
data_combined <- data_wide %>%
group_by(Level, Year) %>%
summarise(
Attendance = Attendance[!is.na(Attendance)],
Enrollment = Enrollment[!is.na(Enrollment)]
) %>%
ungroup()
# 2) Collapse duplicates within (Year, Level, Type)
# (If the catalog has multiple series for the same concept, average them.)
# collapsed <- data_long %>%
# group_by(Level, Year, Type) %>%
# summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop")
# 3) Wide format with both columns side-by-side
# gap_df <- collapsed %>%
# pivot_wider(names_from = Type, values_from = Value) %>%
# mutate(Gap = Enrollment - Attendance)
# Quick diagnostics (see what exists)
# print(table(gap_df$Level, useNA = "ifany"))
# print(summary(gap_df[, c("Enrollment", "Attendance")]))
#
# # If there are no overlapping rows, bail out gracefully
# if (nrow(drop_na(gap_df, Enrollment, Attendance)) < 2) {
# warning("No overlapping Enrollment & Attendance observations after cleaning; skipping correlation and scatter.")
# } else {
# # Correlation and scatter only when data exist
# corr_df <- gap_df %>% drop_na(Enrollment, Attendance)
#
# r_val <- cor(corr_df$Enrollment, corr_df$Attendance, use = "pairwise.complete.obs", method = "pearson")
# message(sprintf("Correlation between Enrollment and Attendance: r = %.2f", r_val))
#
# p_scatter <- ggplot(corr_df, aes(Enrollment, Attendance, color = Level)) +
# geom_point(alpha = 0.6) +
# geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
# labs(
# title = sprintf("Attendance vs. Enrollment (r = %.2f)", r_val),
# x = "Enrollment Rate (%)", y = "Attendance Rate (%)"
# ) +
# theme_minimal(base_size = 12)
# print(p_scatter)
# }
model <- lm(Enrollment ~ Attendance, data=data_combined)
# summary(model) #not significant
by_year_r2 <- data_combined %>%
group_by(Year) %>%
do(glance(lm(Enrollment ~ Attendance, data=.))) %>%
select(Year,r.squared,p.value)
# by_year_r2 #not significant
pretty_r2_table(by_year_r2, "R² and p-values by Year (Red = significant, Black = not significant)")
| Year | r.squared | p.value |
|---|---|---|
| 2015 | 0.9997303 | 0.0104555 |
| 2016 | 0.9659338 | 0.1181787 |
| 2017 | 0.9330262 | 0.1666495 |
| 2020 | 0.3520319 | 0.5956316 |
| 2021 | 0.0093777 | 0.9382539 |
by_schooling_r2 <- data_combined %>%
group_by(Level) %>%
do(glance(lm(Enrollment ~ Attendance, data=.))) %>%
select(Level,r.squared,p.value)
# by_schooling_r2 #not significant
pretty_r2_table(by_schooling_r2, "R² and p-values by School Level (Red = significant, Black = not significant)")
| Level | r.squared | p.value |
|---|---|---|
| Elementary School | 0.1355655 | 0.5420218 |
| High School | 0.2899115 | 0.3491842 |
| Middle School | 0.0677148 | 0.6724550 |
###
p_line_per_school_horizontal <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
geom_line(size=1.2, na.rm=TRUE) +
geom_point(size=2, alpha=0.8) +
facet_grid(~Level, scales="fixed", switch="y") +
theme_minimal() +
theme(
strip.background=element_rect(fill="grey90", color=NA),
panel.spacing=unit(1, "lines")
) +
labs(
title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
y="Rate (%)",
color="Variable"
) +
scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
)
# ggsave("line_per_school_horizontal.png", p_line_per_school_horizontal)
p_line_interactive1 <- ggplotly(p_line_per_school_horizontal)
p_line_interactive1
# 5) Gap over time (only where both exist)
# if (any(!is.na(gap_df$Gap))) {
# p_gap <- ggplot(gap_df, aes(Year, Gap, color = Level)) +
# geom_hline(yintercept = 0, linetype = "dashed") +
# geom_line(linewidth = 1, na.rm = TRUE) +
# geom_point(size = 1.3, alpha = 0.8) +
# labs(title = "Enrollment – Attendance Gap (percentage points)", y = "Gap (pp)", x = "Year") +
# theme_minimal(base_size = 12)
# print(p_gap)
# }
###
eqn_text <- paste0(
"Trendline: y = ",
round(coef(model)[2],3),"x + ",
round(coef(model)[1],3),
"<br>R² = ",round(summary(model)$r.squared,3),
"<br>p = ",signif(summary(model)$coefficients[2,4],3)
)
enroll_attend_scatter <- ggplot(data_combined, aes(x=Attendance, y=Enrollment)) +
geom_point(aes(color=factor(Year)), size=2) +
geom_smooth(method=lm, aes(group=1, text=eqn_text)) +
theme_minimal() +
labs(title="All Attendance vs Enrollment Observations",
x="Attendance",
y="Enrollment",
color="Year")
enroll_attend_scatter_plotly <- ggplotly(enroll_attend_scatter, tooltip=c("text","x","y","color"))
enroll_attend_scatter_plotly
####
enroll_longitudinal <- ggplot(data_combined, aes(x=Year, y=Enrollment, color=Level)) +
geom_point(size=2) +
geom_smooth(method=lm, size=1, alpha=0.25) +
theme_minimal() +
labs(title=" Enrollment Observations Longitudinally",
x="Year",
y="Entrollemnt",
color="Level")
enroll_longitudinal_plotly <- ggplotly(enroll_longitudinal)
enroll_longitudinal_plotly
pretty_cor_table(data_combined$Year, data_combined$Enrollment, "Year", "Enrollment")
| Variable1 | Variable2 | Correlation | p.value |
|---|---|---|---|
| Year | Enrollment | 0.099 | 0.725 |
###
attend_longitudinal <- ggplot(data_combined, aes(x=Year, y=Attendance, color=Level)) +
geom_point(size=2) +
geom_smooth(method=lm, size=1, alpha=0.25) +
theme_minimal() +
labs(title=" Attendance Observations Longitudinally",
x="Year",
y="Attendance",
color="Level")
attend_longitudinal_plotly <- ggplotly(attend_longitudinal)
attend_longitudinal_plotly
pretty_cor_table(data_combined$Year, data_combined$Attendance, "Year", "Attendance")
| Variable1 | Variable2 | Correlation | p.value |
|---|---|---|---|
| Year | Attendance | -0.594 | 0.0196 |
###
# 6) Summary stats (use collapsed to avoid duplicate inflation)
summary_stats <- data_long %>%
group_by(Type, Level) %>%
summarise(
n = n(),
Mean = mean(Value, na.rm=TRUE),
Median = median(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE),
Variance = var(Value, na.rm=TRUE),
Minimum = min(Value, na.rm=TRUE),
Maximum = max(Value, na.rm=TRUE),
Range = Maximum - Minimum,
Q1 = quantile(Value, 0.25, na.rm=TRUE),
Q3 = quantile(Value, 0.75, na.rm=TRUE),
IQR = Q3 - Q1,
) %>%
arrange(Type, Level)
paged_table(summary_stats)
write_csv(summary_stats, "tendency_attend.csv")
This section examines how teacher workforce size and
compensation have evolved in the United States from
2014 to 2021.
Using EdStats_teacher.csv, we identify indicators relating
to the number of teachers and the share of
educational expenditure allocated to teaching staff
compensation.
After filtering out non-teaching staff and qualification percentages,
the data are reshaped into tidy long form to create two new analytical
variables:
Teacher counts are scaled per 10 000 to make them visually comparable
to percentages.
We then:
Type × Level
combination.tendency_teachers.csv
for reproducibility.Reading tip:
Focus on whether the compensation share rises or falls
with the number of teachers.
Parallel upward trends may indicate sustained investment in
instructional personnel, while divergence could suggest funding
reallocation, changing class sizes, or reporting gaps across education
levels.
### This is not done
# 1) Read & filter (keep only teaching staff; drop qualification % series)
teacher_raw <- read_csv("EdStats_teacher.csv", show_col_types = FALSE) %>%
filter(!str_detect(`Indicator name`, regex("non-?teaching staff", ignore_case = TRUE))) %>%
filter(!str_detect(`Indicator name`, regex("percentage of qualified", ignore_case = TRUE)))
# 2) Long format (auto-detect year columns) + standardized labels
teachers_long0 <- teacher_raw %>%
pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
mutate(
Year = as.numeric(Year),
Type = case_when(
str_detect(`Indicator name`, regex("compensation", ignore_case = TRUE)) ~
"Compensation % of Total Expenditure",
str_detect(`Indicator name`, regex("teachers? in|number of teachers", ignore_case = TRUE)) ~
"Number of Teachers",
TRUE ~ NA_character_
),
Level = case_when(
str_detect(`Indicator name`, regex("\\bpre-?primary\\b", ignore_case = TRUE)) ~ "Pre-Primary",
str_detect(`Indicator name`, regex("\\bprimary\\b", ignore_case = TRUE)) ~ "Primary",
str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Lower Secondary",
str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "Upper Secondary",
str_detect(`Indicator name`, regex("\\bsecondary\\b", ignore_case = TRUE)) ~ "Secondary (unspecified)",
str_detect(`Indicator name`, regex("\\btertiary\\b|post-?secondary|higher education", ignore_case = TRUE)) ~ "Tertiary",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(Type), !is.na(Level), !is.na(Value))
# 3) Collapse duplicates within (Year, Level, Type) to ensure 1 row per cell
teachers_long <- teachers_long0 %>%
group_by(Level, Year, Type) %>%
summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop")
# 4) Scale teacher counts to 10,000s for visual comparability with percentages
teachers_long <- teachers_long %>%
mutate(Value_scaled = if_else(Type == "Number of Teachers", Value / 10000, Value))
# (Optional) order levels for consistent faceting
lvl_order <- c("Pre-Primary", "Primary", "Lower Secondary", "Upper Secondary",
"Secondary (unspecified)", "Tertiary")
teachers_long <- teachers_long %>%
mutate(Level = factor(Level, levels = intersect(lvl_order, unique(Level))))
# ---- A. Time-series trends by level (scaled) ----
p_line_teacher <- ggplot(teachers_long, aes(x = Year, y = Value_scaled, color = Type)) +
geom_line(linewidth = 1.1, na.rm = TRUE) +
geom_point(size = 1.7, alpha = 0.8) +
facet_wrap(~ Level, ncol = 3, scales = "fixed") +
theme_minimal(base_size = 12) +
theme(strip.background = element_rect(fill = "grey95", color = NA)) +
scale_color_manual(
values = c("Number of Teachers" = "#FF69B4",
"Compensation % of Total Expenditure" = "#3BBF8F"),
labels = c("Number of Teachers (per 10,000)", "Compensation (% of total)")
) +
labs(
title = "Teachers & Compensation Trends by Level (USA)",
y = "Scaled value (Teachers per 10,000; Compensation %)",
color = "Series"
) +
scale_x_continuous(breaks = pretty(unique(teachers_long$Year)))
p_line_teacher
# Interactive view (kept optional for HTML)
ggplotly(p_line_teacher)
# ---- B. Coverage diagnostics (counts by Type / Year / Level) ----
p_count_type <- ggplot(teachers_long, aes(x = Type)) +
geom_bar(fill = "#c562f0") +
theme_minimal(base_size = 12) +
labs(title = "Coverage: Count of Observations by Type", x = "Series", y = "Count")
p_count_type
p_count_year <- ggplot(teachers_long, aes(x = factor(Year))) +
geom_bar(fill = "#c562f0") +
theme_minimal(base_size = 12) +
labs(title = "Coverage: Observations per Year (all levels)", x = "Year", y = "Count")
p_count_year
p_count_level <- ggplot(teachers_long, aes(x = Level, fill = Type)) +
geom_bar(position = "dodge") +
theme_minimal(base_size = 12) +
labs(title = "Coverage: Observations by Level and Type", x = "Level", y = "Count", fill = "Series")
p_count_level
# ---- C. Correlation between staffing and compensation (where both exist) ----
# Wide table on the *scaled* values so axes are comparable
wide_tc <- teachers_long %>%
select(Level, Year, Type, Value_scaled) %>%
pivot_wider(names_from = Type, values_from = Value_scaled)
# overall (pooled) correlation with guard
corr_tbl <- tibble::tibble()
valid_overall <- wide_tc %>%
mutate(valid = complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)) %>%
filter(valid)
if (nrow(valid_overall) >= 3) {
overall_r <- cor(valid_overall$`Number of Teachers`,
valid_overall$`Compensation % of Total Expenditure`,
use = "complete.obs", method = "pearson")
corr_tbl <- bind_rows(corr_tbl, tibble(Level = "All levels (pooled)", r = overall_r))
}
# by-level correlations (no cur_data(); compute within each group)
by_level_r <- wide_tc %>%
group_by(Level) %>%
summarise(
n_pairs = sum(complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)),
r = ifelse(
n_pairs >= 3,
cor(`Number of Teachers`, `Compensation % of Total Expenditure`,
use = "complete.obs", method = "pearson"),
NA_real_
),
.groups = "drop"
) %>%
filter(!is.na(r))
corr_tbl <- bind_rows(corr_tbl, by_level_r %>% select(Level, r))
corr_tbl
## # A tibble: 6 × 2
## Level r
## <chr> <dbl>
## 1 All levels (pooled) -0.313
## 2 Pre-Primary -0.571
## 3 Primary 0.551
## 4 Lower Secondary -0.830
## 5 Upper Secondary -0.862
## 6 Secondary (unspecified) -0.334
# ---- D. Summary statistics (per Type × Level) ----
summary_stats_teachers <- teachers_long %>%
group_by(Type, Level) %>%
summarise(
n = n(),
mean_value = mean(Value, na.rm = TRUE),
median_value = median(Value, na.rm = TRUE),
sd_value = sd(Value, na.rm = TRUE),
min_value = min(Value, na.rm = TRUE),
max_value = max(Value, na.rm = TRUE),
IQR = IQR(Value, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(Type, Level)
write_csv(summary_stats_teachers, "tendency_teachers.csv")
summary_stats_teachers
## # A tibble: 12 × 9
## Type Level n mean_value median_value sd_value min_value max_value
## <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Compensatio… Pre-… 6 49.0 49.0 6.83e-1 48.1 49.8
## 2 Compensatio… Prim… 6 49.0 49.0 6.83e-1 48.1 49.8
## 3 Compensatio… Lowe… 6 49.0 49.0 6.83e-1 48.1 49.8
## 4 Compensatio… Uppe… 6 49.0 49.0 6.83e-1 48.1 49.8
## 5 Compensatio… Seco… 6 46.8 46.7 9.30e-1 46.1 48.7
## 6 Compensatio… Tert… 6 28.5 28.3 7.90e-1 27.6 30.0
## 7 Number of T… Pre-… 12 531226. 608720. 1.40e+5 302255 651997.
## 8 Number of T… Prim… 17 1604827. 1687937 1.49e+5 1414000 1774348.
## 9 Number of T… Lowe… 14 810580. 858595. 8.81e+4 670172 894725.
## 10 Number of T… Uppe… 14 774670. 813548. 7.75e+4 650603 848440.
## 11 Number of T… Seco… 15 1501112. 1661375 2.95e+5 784414. 1737206.
## 12 Number of T… Tert… 22 853184. 747500 3.14e+5 574000 1581424
## # ℹ 1 more variable: IQR <dbl>
These observations are preliminary and will guide further analysis.
Reproducibility Notes
All intermediate datasets and figures are written to disk
(EdStats_USA.csv, EdStats_attend.csv,
EdStats_teacher.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating
earlier processing.
Consider adding sessionInfo() at the end to document
package versions.
reference string
reference string
reference string